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The  block  state  variable  fora  Is  investigat¬ 
ed  as  e  technique  to  increese  the  parallel  lea  of 
e  filter.  This  Increase  in  parellellsa  allowe 
aore  parallel  processors  to  be  usefully  applied 
to  the  problaa,  resulting  in  e  fester  processing 
rate  than  is  possible  in  the  unblocked  fora. 
Upper  and  lower  bounds  on  the  saaple  period  bound 
and  the  number  of  processors  resulted  to  support 
it  are  deterained. 


xmoDocnm 

In  digital  filtering  applications  where  the 
maximum  processing  rate  is  of  fundamental  Im¬ 
portance,  in  particular  real-tine  processing, 
higher  rates  can  be  achieved  by  fester  processors 
or  aore  parallel  processors.  For  many  problems 
fester  hardware  la  not  practical  or  ooat  affec¬ 
tive  compared  to  siaple  multiprocessor  solutions, 
particularly  for  VLSI  implementations. 

Recurrence  relations,  such  es  recursive 
filters,  specified  by  'fully  specified  signal 
flow  graphs,*  have  bean  shown  to  have  e  maximum 
parallelism  that  is  constrained  by  one  or  aore 
'critical  loops.*  Adding  additional  processors, 
beyond  the  aaxiaua  parallelism,  performs  no  di¬ 
rectly  useful  work.  Bowaver  it  is  possible  to 
increase  the  par  all  all  an  of  the  problem  by  tranm- 
foraatlon  to  e  block  fora. 

This  pa  par  concentrates  on  the  block  state 
variable  fora.  Any  particular  fully  specified 
aaaber  of  this  class  of  filters  has  e  well  defin¬ 
ed  saaple  parlod  bound  end  any  particular  filter 
has  e  specific  fully  specified  fora  which  results 
in  the  minimum  saaple  parlod  bound.  Deteralne- 
tlon  of  the  exact  bound  requires  a  lengthy  search 
operation.  Bowaver,  the  determination  of  tbs 
saaple  parlod  bound  can  Itself  be  bounded  by  tbs 
gross  properties  of  the  systsa  matrix.  This 
pepar  explores  the  block  state  variable  fora  and 
determines  an  upper  and  lower  bound  on  tbs 
'saaple  period  bound,*  and  the  associated  number 
of  processors  required.  Xt  is  also  shown  that 
for  aany  problems  the  blocked  fora  has  lower 
computational  requirements,  and  decreased  finite 
word  effects,  even  if  evaluated  on  a  typical 
sequential  uniprocessor. 
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Plow  Graph  Specif lcmtlom 

A  fully  spsclflsd  flowqrsph  is  e  generalised 
flow  graph  in  whim  the  node  operations  are  all 
fundamental  operations  of  the  constituent  pro¬ 
cessor  on  which  the  algorithm  will  be  implemented 
[1).  The  definition  of  the  node  operations  in 
the  fully  specified  flow  graph  eats  the  granu¬ 
larity  with  wtoirm  the  parallelism  can  be 
exploited. 

Flow  Grachs  Bounds 

Given  e  fully  specified  flow  graph  it  ia 
possible  to  compute  the  lower  bound  on  the  sample 
period  bound  (or  rate  bound  which  is  the  recipro¬ 
cal  of  the  sample  period  bound),  which  is  always 
achievable.  The  sample  period  bound  is  beet 
understood  in  the  eentext  of  a  recursive  single- 
tine-  index  flow  graph  (e.g.  an  XXX  digital 
filter),  although  the  concept  is  meaningful  in 
systems  which  have  so  explicit  eaaple  period. 

For  such  systems  the  sample  parlod  bound  is 
givsn  by 

(1) 


Miere  1  varies  over  the  set  of  all  loops, 
the  total  delay  areed  loop  i. 


'  I  Id  1 

ia 


is 


(2) 


The  oomputst  tonal  time  to  per  fora  the  operation 
of  node  l  ia  d.,  and  n  is  the  number  of  delays 
in  loop  i.  This  is  e  generalisation  of  e  result 
published  by  Benfors  and  Nuevo  (2). 

Any  loop  for  which  T.mD/n.mT  is  consi¬ 
dered  e  critic lal  loco.  *  *  * 

Let  D  be  the  total  computational  delay  of 
all  the  nodes. 

•  ■ }  “i1 

Then  the  maximum  parallelism,  or  number  of  pro¬ 
cessors  in  e  'processor  optimal*  solution,  is  the 
total  delay  divided  by  the  saaple  parlod  bound. 


9  -  B/*0  (4) 

The  maximum  parallelims  thus  defined  is  the  maxi¬ 
mum  parallel  Isa  such  that  at  all  instances  P 
operations  can  be  performed  in  parallel.  Xt  is 
important  to  note  tha  this  is  not  the  same  eon- 
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cept  ■■  tha  maximum  nuab«r  et  parallel  operations 
that  can  be  achieved  by  a  'greedy  scheduler  ,*  In 
which  each  operation  ia  performed  as  soon  aa  it 
is  possible.  Rather,  it  la  a  constant  level  of 
parallelism  which  allows  for  exactly  P  operations 
to  be  performed  on  every  cycle.  Another,  im¬ 
portant  point  to  reatsta  is  that  using  more  than 
t  processors  will  not  decrease  the  sample  period 
bound. 

Optimality 

This  work  ass  was  an  iapleoentatlon  that 
■sate  the  following  optiaallty  criteria.  An 
implementation  is  processor  optlnal  if  it  exhib¬ 
its  perfect  processor  efficiency,  if  evary  cycle 
of  evary  processor  is  used  directly  on  the  funda¬ 
mental  operations  of  the  algorithm  (flow  graph) 
and  no  cycles  are  used  for  synchronisation  or 
•yatam  control.  If  an  implementation  achieves 
the  sample  period  bound  it  is  considered  rata 
optimal.  From  the  previous  section  it  ia  seen 
that  an  implementation  that  is  processor  and  rata 
optimal  requires  exactly  P  processors. 

BLOCK  STAS  VAKIABLB  POM 

Any  systam,  H(e),  with  a  rational  transfer 
function  can  be  expressed  in  state  variable  font 

H(s)  -  C(sX  -  A)-1B  ♦  D 

Let  Nk  be  the  state  vector,  0^  the  input  and  yk 
the  output  at  time  k.  The  state  equation  ia  then 
tba  familiar  i 

I.  Vl  “  »k  4  “k 

yk  "  «k  4  "k  (5> 

For  simplicity  and  clarity  this  paper  will 
only  consider  sing  la  input,  single  output  (BIBO) 
systems.  Tba  generalisation  ia  straight¬ 
forward.  For  the  case  of  a  system  of  ordar  K,  A 
ia  It'S,  b  is  Rxl,  C  ia-im  and  D  a  scalar. 

Tha  original  scalar  system t  I,  can  be  eon- 
verted  to  a  block  form  system,  r^,  that  operates 
on  a  block  of  L  (sequential)  inputs  and  produces 
L  (sequential)  outputs  in  parallel.  Our  goal  ia 
to  show  that  as  tha  block  slse  increases  the 
sample  period  bound  depresses. 

The  new  eysteu,  I.,  le  defined  in  terms  of 
the  original  system  as  follows i 


Jt  A  ■  AL,  ■  -  [AL_1BIAL_2BI  ...  IABIB] 


®k  *  t°KL  °KLe1  ***  °KL*L-1 1 
7k  “  lyn.  yn>+1  ***  yKL+L-1  ^ 
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The.polae  of  t  and  L  are  the  eigenvalues  of 
A  and  \  respectively,  denoted  {X.  »>,*••***«) 
and  Since  A-AL,  then  A^l?.  |a 
the  block  slsa  L  increases,  tha  poles  of  t- 
spiral  into  the  origin.  This  leads  to  increased 
stability  and  decreased  coefficient  quantisation 
error  since  tha  eoded  coefficients  ara  those  of 
A.  Given  fixed  point  laplamantation  with  finite 
precision  and  a  sufficiently  larga  block  alee, 
L>K,  L  reduces  to  an  FIB  (no  recursion)  system, 
trtiich  Iras  unbounded  parallelism.  Tha  sampling 
rata  of  an  FIB  system  ia  bounded  only  by  avail¬ 
able  resources  and  tolarabla  throughput  delay 
1*1. 

what  of  the  parallelism  of  a  blocked  system 
with  block  else  lass  than  M?  A  difficulty  with 
tha  block  state  variable  (state  variable)  form  is 
that  while  the  A  matrix  defines  the  form  of  the 
recursive  pert  of  tha  network,  it  does  not  speci¬ 
fy  tbs  ordar  of  the  additions.  To  rephrase,  the 
stata  equations  only  define  tha  generic  signal 
flow  graph.  Recall  that  only  a  fully  specified 
signal  flow  graph  has  a  sample  period  bound  and 
that  a  ganeric  graph  may  have  a  large  number  of 
different  fully  specified  graphs  with  different 
associated  bounds.  A  good  example  is  a  non- 
blocked ,  Bth  ordar  direct  form  canonic  filter. 
Tba  optimal  fully  specified  flow  graph  has  a 
■ample  period  hound  of  t  et_  (add  time  ♦  multiply 
time),  while  the  worst  ease  fully  specified  graph 
has  a  bound  of  (R-1)tBvt_.  For  a  specific 
generic  graph,  it  is  straightforward  to  find  the 
fully  specified  fora  with  tha  lowest  possible 
■ampla  period  bound  using  an  iterative  tree 
height  balancing  algorithm. 

Despite  these  difficulties,  it  is  still 
possible  to  specify  sn  upper  and  lower  hound  on 
tha  sample  period  bound  of  the  state  equations. 
Confidar  the.  block  state  system.  Computet  ion 
of  yk  given  Wk  is  non  rseursive  and  oan  be  over¬ 
lapped  with  following  .blocks ,  if  necessary.  The 
matrix  product  BkDk>  Vk  oan  he  preoenputad,  ever 
precluding  blocks,  if  necessary,  resulting  in  a 
new  simple  input  vector.  Thus,  the  sample  period 
is  determined  by  the  recursive  portion  piss  a 
simple  input.  Examining  the  form  of  the  update 
equations  it  can  be  seem  that  the  spdata  of.  each 
state  var  labia  can  proceed  in  parallel.  This 
viewpoint  leads  to  the  determination  of  the  upper 
bound  on  the  sample  period  bound. 

In  general  for  the  bjock  state  form,  ooour- 
renoe  of  saroes  la  the  \  vector  are  rare  tor 
non-sero  irjput  sequences.  All  of  the  multiplica¬ 
tions  of  (A)  * «*(Wfc) .  oan  proceed  in  parallel 
contributing  /  data}. of  t_.  Banning  the  products 
of  row  1  of  A  with  plus  the  input  term  <  v. ) . , 
with  a  balanced  tree  owner,  introduces  a  delay 
of  |log,nilta,  «dvere  ni  is  .the  number  of  non  sere 
coefficients  in  row  i  of  A.  The  upper  bound  on 
the  sample  period  is  therefore  determined  by  the 
row  of  A  with  the  moot  asm  aero  coefficients. 
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To  determine  tha  lover  bound,  tacall  that 
what  dstsrmlnas  tha  bounds  ara  loops.  Tha  updata 
of  tha  atata  variable  (W*) .  la  a  Might  ad  aua  of 
a\l  atatf  variables.  It  In  tha  computation  of 
0«k),,  4o»>  not  for*,  a  loop  with  (1L),  than 
tha  Might  as  aua  of  all  „(Wk)  jcJ  (J  Is  tha  aat 
of  lndqxas  j  such  that  (Wk)  .  aoee  not  fora  a  loop 
with  ,)  .  can  ha  precompiled  as  a  sing  la  Input 
(k1-(h)1^(Wk) j)).  This  laada  to  at  laaat  oaa  of 
tha  atata  variables  not  containing  a  pceeoaput- 
abla  partial  weighted  sia.  Therefore,  tha  lowar 
bound  aust  ha  graater  than  or  pqual  to  that  asao- 
clatad  with  the  row  of  A  with  tha  laaat 
coefficients,  dovevar  sines  It  la  possible  that 
tha  critical  loop  contains  a  p  unit  daisy  Instead 
of  a  unit  daisy  if  la  necessary  to  divide  tha 
computational  daisy  by  p  to  yield  tha  aasgls 
period  bound.  A  necessary  condition  for  a  p  unit 
daisy  to  exits  In  a  critical  loop  la  that  A  con¬ 
tains  p  rows  with  precisely  one  non-saro 
coefficient. 

for  block  form  systems,  what  la  of  main 
Interest  In  not  the  staple  period  bound,  but  tha 
aaapla  period  bound  per  output  sample.  This  is 
just  tha  sample  period  bound  divided  by  the  block 
slsa,  which  yields  tha  average  time  between  suc¬ 
cessive  output  aanplaa.  The  per  output  quali¬ 
fication  beraaftar  la  Implied  whan  referring  to 
the  sample  period  bound,  unless  stated  other¬ 
wise.  Tha  bounds  on  tha  sample  period  bound  la 
therefore  given  as  follows  (for  tha  opiglnal 
unblocked  system  substitute  Lei  and  A  for  A)i 


flog 2 (min  {n^  ♦  1)lt#  ♦  tB 


"*0" 


nog2(max{ni)  ♦  Hit,  ♦  t^ 


(7) 


■bars  n1  Is  tha  number  of  non  aaro  coefficients 
In  row  1,  p  Is  the  nimfear  of  rows  of  A  with 
axaetly  one  non-saro  coefficient  and  L  Is  the 
block  sloe. 

If  the  system  is  not  a  parallel  or  serial 
cascade  than  blocking  the  system  with  a  block 
slsa  of  LM-2  typically  results  In  a  system  with 
no  (vary  faw)  non-saro  coefficients.  This  re¬ 
sults  In  tha  worst  cans  sampla  period  bound  oft 

.  riog,(hel)1t  ♦  t 
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Computet lor al  hsqulremeeta 

Tha  computational  rsqulraments  In  terms  of 
the  n«mber  of  operations  and  niaber  of  required 
processors  Is  derived  by  aaaialng  a  straight 
forward  Implementation  of  the  state  variable 
equations.  It  Is  further  assumed  that  the  system 
Is  of  state  spaos  form,  and  has  no  saro  coeffi¬ 
cients  (worst  case).  Tha  constituent  processors 
ara  assumed  to  have  karnal  operations  of  *two 
Input  addition*  and  *multlpllcatlon«* 

The  number  of  multiplications .ara  thq  number 
of  non  aero  coefficients  In  A,  s,  c,  and  D  (A,  I, 


C  and  D).  Tha  somber  of  additions  ara  n-1  for 
each  n  element  row  a  solemn  Inner  product  and  a 
for  each  addition  of  m  element  vectors.  There¬ 
fore  tha  number  of  multiplies  per  output  and  the 
number  of  addlticma  per  cMput  ara  given  byt 


I* 


Mult/output  ■  S*  ♦  2N  ♦  1 
■add/ootput 


(» 


output 
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Aa  can  he  seen  from  Mg.  1,  for  block  slses 
lass  than  approximatlay  Jr,  tha  total  number  of 
multiplications  Is  lass  than  for  the  no nb locked 
or  Lei  form.  Tha  minimus  for  the  mmhet  of  mul¬ 
tiplies  per  output  occurs  for  a  block  also  of 
L»/S.  Tha  graphs  for  eddltiooa  ara  nearly 
1 dan t leal  to  tboae  for  tha  multiplications,  with 
the  minima  occuxlag  at  L*  /S (H-1 ) .  Mora  spares 
raallsatlona  may  have  leas  significant  savings  in 
total  operations. 

Making  tha  assumption  that  t,"*  ,  ilium 
for  a  simpler  determination  of  the  number  as 
processors  or  parallel  1»  fra  the  nnmbsr  of 
operations.  As  la  tha  previous  portions,  this 
rasult  Is  for  the  fully  populated  state  space 
fora,  which  Is  known  to  hava  pcooesaor  and  rata 
optimal  solutions.  The  aanber  of  processors  is 
equal  to  tha  total  arithmetic  delay  divided  by 
the  saple  period  honed. 


p-l.  t  „„ 
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For  block  systems  the  order  of  the  number  of  the 
processors  required  is  roughly  proportional  to 
tha  block  slsa  squared  (sinee  ■  Is  fixed).  Com¬ 
bining  equations  (•)  and  (12)  for,  m>1,  the 
number  of  normalised  processors  as  a  function  of 
the  normalised  rate  bound  is  shown  in  Mg.  2. 
normalisation  lit  this  ease  implying  that  for  U1, 
one  normalised  processor  processes  at  a  normalis¬ 
ed  rate  of  one.  Thus  the  graph  indicates  the 
relative  cost  of  a  given  rate  increase. 

Hock  Mormal  Mg 

Increasing  the  block  also  tends. to  deoeaoe 
the  spar  sene  as  of  the  spmem  matrix  A,  end  them 
leads  to  larger  increases  in  tha  number  of  opera¬ 
tions.  If  the  locked  system  is  of  blank 

diagonal  form,  the  blocked  system  matrix  is  of 
the  same  block  diagonal  form,  with  no  attendant 
decrease  in  eparaaoeas.  Mila  the  first  torn 
that  nay  occur  to  the  raaflK  is  the  Jordan  normal 
form,  this  Implies  eomplen  arithmetic  which  lands 
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to  grtitir  complexity  and  an  increased  sample 
par lod  bound  (t^-canplex  •  t -real  *  tg-raai). 
Tha  parallal  eaaeada  of  second  ordar  no  raal  fora 
aactiona  laada  to  aa  attractive  block  dltfonal 
fora.  Tha  block  normal  form  la  particularly 
attractive  in  that  tarnes  (4]  baa  a  hour*  that  1) 
average  roundoff  aolme  ia  decreased  by  a  factor 
of  I>(  2)  for  I#  aufficiently  large  all  eutomonoue 
limit  cycle  can  be  eliminated.  3)  minima  no  lee 
unblocked  forma  lead  to  minima  noise  blocked 
forma  and  4)  scaling  for  fixed  point  implementa¬ 
tions  of  tbe  unblocked  system  reaulte  1m  a 
blocked  system  with  proper  scaling. 

To  determine  tbe  sample  period  bound  and 
parallelism  of  e  parallel  normal  form  consider  an 
nth  order  pi  even)  aystm  with  block  else  L.  Tbe 
eystm  matrix  for  tbia  case  ia  block  diagonal 
with  each  block  being  e  2x2  submatrix  with  .non- 
aero  coefficients.  Since  each  row  of  A  baa 
exactly  two  non-sero  coefficients,  tbe  upper  and 
lover  bounds  on  tbe  aample  period  bound  are  tbe 
same.  Therefore  the  aample  period  bound  is  given 
byt 


-  flog,3lt  et  2tet  (erH)t. 

To - - —L  (13) 


Mote  that  tbia  aystem  exhibits  direct  linear 
speedup  vlth  block  else. 

Counting  operations  par  output  sample i 

Itault/output  -  2M(L+1)/L  ♦  (L+1  )/2 

(14) 

Madd/output  -  M (2L+1 ) /L  ♦  (L-1)/2 
Tbe  parallellma  le  them 

:  .  t  /D  -  (15) 

o  2a44 

The  number  of  processor  le  thus  of  order  fc2/2. 
Tbe  number  of  multiplies  is  minimised  for  I> 
•  2/n,  and  tbe  number  of  adds  is  minimised  for  I> 

-  f&. 


Transforming  a  •  state  variable  system  to  a 
block  state  variable  form  increases  tbe  effective 
parallellM  and  decreaaee  tbe  sample  period 
bound.  Tbe  sample  period  bound  asymtotioally 
approaches  direct  linear  speedup  as  tbe  block 
else  Increases,  with  an  attendant  eoet  of  order 
l2  pcoeeeaore.  The  block  fora  not  only  baa 
better  numerical  properties  than  tbe  unblocked 
fora,  it  may  require  fever  operations.  Sven  if 
tbe  Implementation  is  to  be  a  sequential  unipro¬ 
cessor  tbe  maser  leal  and  complexity  properties  of 
tbe  block  fora  offer  significant  benefits  over 
tbs  unblocked  form. 
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Pig.  1  Number  of  multiples  as  s  function  of 
block  alse  fog  state  space  system  of 
order  N. 
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Pig.  2  Normalised  number  of  processors  required 
to  achieve  a  normalised  em^l*  rets 
increase  for  state  specs  systmi  of  order 
M. 


